Spelling error detector apparatus and methods

ABSTRACT

A spelling error detector apparatus employs a memory which stores alphabetical words as those existing in the English language in three major memory sections which constitutes a most frequently used word list (MFU), a master word list (MWL) and a personal word list (PWL). Each word is uniquely coded and stored as a 24 bit binary number. The system then retrieves words which are stored or entered into a processor memory and which words are indicative of a document to be printed. Each word emanating from the processor memory is converted into the same code as the stored words and then a search is made to determine whether the processor word compares with a word as stored. If a favorable comparison is had, it is assumed that the spelling of the processor word is correct. If an unfavorable comparison is had, it is assumed that the spelling is incorrect and the misspelled word is stored in a separate memory which can be accessed by the operator in order to make the necessary corrections. Similarly each word which is correctly spelled is also stored in a memory which has the capacity to store a plurality of the last words checked by the system. Based on the system considerations the coding of the words assures a very low collision rate and hence the system is extremely reliable in detecting misspelled words according to the disclosed techniques.

BACKGROUND OF INVENTION

This invention relates to apparatus for detecting misspelled words inelectronically stored documents and more particularly relates to aspelling error detector and methods of operation.

In modern technology there are many devices which store documentselectronically and when accessed will create a printed image of thestored data. As such, these devices include word processors.

The word processor is a machine on which a printed image can becorrected and manipulated before it is printed out in final form. Theword processor uses computer technology to operate with words. There areof course many other types of systems which essentially store data inthe form of words and subsequently can print out the data. These systemsinclude data processing computers and so on.

Modern word processors utilize four basic elements which are a visualdisplay unit and an input keyboard, a memory, a text storage media and aprinter. The combination of a typewriter keyboard and the visual displayunit is generally referred to as the Work station. The display enablesthe operator or typist to see the text before it is finally printed.Displays vary from single line displays to full page displays. Asindicated, every word processor has an internal memory unit where thetext or words are stored and manipulated. The space available in theword processor memory for the text is normally not very large and inmost word processors the memory can only hold one or two pages of text.Hence in more sophisticated units, additional pages of text aretransferred into a remote memory designated as a text storage media.This additional memory usually consists of a cassette tape, a floppydisk or diskette.

Essentially, the most common device employed is a diskette or floppydisk and this memory device can hold between 80 to 160 typical pages oftext. Of course new developments are continuously being made and thereexists hard disk memories which permit higher storage levels and fasteraccess times. In any event, there is need in conjunction with suchequipments to detect misspelled words in documents which are stored asabove described.

In using conventional techniques, small computer systems as well as wordprocessors do not have sufficient storage capacity nor processing powerto check the spelling of the stored words. There are large expensive,typesetting machines which typically use a 20 to 80 million charactermass storage device to actually store an abridged dictionary. Thesemachines use well-known indexing methods to check spelling of storeddocuments. However, small computers and word processors only have onehundred thousand to two million characters of storage and hence, thismemory is not enough to hold a sizable dictionary.

More importantly is the fact that a small computer or word processorcannot search through long word lists in a reasonable period of time andhence, to check spellings by prior art techniques will be extremely timeconsuming.

In the prior art, one system employed in conjunction with a smallcomputer attempts to solve the problem by dividing words in to lists ofprefixes, suffixes and word roots. The time to locate a word root israther small, therefore the search operation is relatively rapid.However, these techniques do not permit automatic hyphenation and alsoallow certain invalid words to appear as correctly spelled. For example,a word such as "perfix" is considered to be a correct spelling since"per" is a valid prefix, and "fix" is a valid root. It is therefore anobject of the present invention to provide apparatus for use withprocessing systems and general purpose computers which apparatus willquickly and rapidly isolate misspelled words in electronically storeddocuments.

It is a further object to provide such apparatus to enable documentswhich are electronically stored to be hyphenated automatically.

BRIEF DESCRIPTION OF THE PREFERRED EMBODIMENT

A spelling error detector apparatus is employed for use in conjunctionwith a word processor and is capable of detecting spelling errors inalphabetical words which are stored in a memory associated with the wordprocessor and indicative of a document to be printed. The systemcomprises means for retrieving each word stored and for converting thestored word into a binary word having a given number of bits, formatmeans rotate a selected number of bits of said word to provide a newword indicative of a code for the stored word, a memory has storedtherein a plurality of words which have been coded according to saidformat. These words are compared with said new word and if a favorablecomparison is had, it manifests a correct spelling. If an unfavorablecomparison is had, it manifests an incorrect spelling. Selected words ofthe English language, for example, are converted to a unique code bymeans of the above format to assure an extremely reliable and accuratedetection system so that misspelled words will not be confused withother misspelled words.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a diagrammatic view depicting a memory format according tothis invention.

FIG. 2 is a diagram showing an operation performed on a word accordingto this invention.

FIG. 3 is a top plan view of a disk showing word storage.

FIGS. 4a and 4b are detailed block diagrams and flow charts describingoperation of the system.

DETAILED DESCRIPTION OF THE FIGURES

Referring to FIG. 1, there is shown a storage arrangement whichcomprises the electronic storage of a dictionary employed in thisinvention.

As seen from FIG. 1 the dictionary according to this invention consistsof four parts. A first storage section 10 is designated as mostfrequently used words (MFU). This list consists of the 6,500 mostfrequently used words in the English language. Essentially, the wordlist exists in many abridged dictionaries and can vary between 5,000 to8,000 words depending on the size of the hardware used.

As will be explained, the four part dictionary to be described is storedon a floppy disk. The next section of the disk 11 is a master word list(MWL). This list consists of 70,000 to 100,000 English words andgeographical names. The master list is also obtained from a dictionaryof a larger size and is also stored on the same floppy disk. A thirdlist is designated as a personal word list (PWL). This list consists ofeach individual users particular additions to the dictionary. Forexample, the PWL will contain names of company executives, the companyname, technical words such as electronic, chemical or other terms whichare used in that particular industry. Storage is reserved for the PWLfrom 0 to 10,000 words depending upon the user and how many words aparticular user desires to include in the PWL. There is also a fourthstorage section 13 designated as a collision list. This list resolvesthe conflicts between the words and codes.

As will be explained, in order to accurately and reliably checkspellings one must convert each word of the English language to adigital number or a digital word and then compare the typed words asstored with the digital numbers indicative of the correct spelling asstored in the memory.

In order to avoid multiple errors the code used must provide a minimumnumber of conflicts. As will be explained, if this number is extremelysmall then most words will be correctly detected. In the system to bedescribed, the collision list 13 stores about twenty words which arecapable of being reduced to the same code as another word.

All four parts of the memory as the MFU 10, the MWL 11, the PWL 12 andthe collision list 13 are used together during the spelling checkoperation. The MFU 10 accomplishes high speed checking. The MWL 11 hasenough memory capacity to insure that most of the English language isincluded. The PWL 12 tailors the dictionary to the end-user's ownenvironment by accepting unusual words common to a profession such aslegal or technical terminology. These words are of course added to thememory by the user who formulates the PWL 12. The collision list 13contains the pair of words that reduce to the same code. As will beexplained, the system to be described uses a unique method of operatingon the English language so that the number of words in the collisionlist is small. As will be explained, the collision list 13 is essentialin order to allow the system to provide automatic hyphenation.

As will be explained, in operation of the system the most frequentlyused words from the MFU 10 are transferred into the word processormemory when a spelling check sequence begins. This is designated bynumeral 15 of FIG. 1.

The word processor memory also contains a storage area 16 which storesthe last 64 words that are checked (L 64). Another area of storagecontains a list 17 of the last 32 words which were misspelled.

In order to explain operation, a few brief comments will be given inregard to system conditions. An extremely important aspect of the systemis the technique employed in organizing the word list within the memory.In this system words are grouped by the number of letters in the word.When the system reads the word from a document or from a memory it knowsinstantly how long the word is. Thus, in this system, if the word to bechecked contains nine letters, the system will only check nine letterwords. This indexing structure reduces the number of words to besearched by at least 90%. Further refinements include alphabeticarrangement within word length groups and an "s" allowed switch. Withthe latter, words that form their plural or present participle with theaddition of a single "s" occur once in a word list (example: sit(s),game(s)).

In order to increase system speed, the apparatus uses two small wordlists which are maintained in the high speed memory during a spellingcheck of a document. As shown in FIG. 1, list 16 contains the last 64words checked. This is important as research has shown that unusual orinfrequently used words are often repeated in the same paragraph andhence list 16 allows one to rapidly check such words without goingthrough the entire memories 10 to 13.

Memory 17 maintains a list of the last 32 misspelled words. This list isnecessary as misspelled words are usually misspelled again. Hence, byproviding storage in the high speed memory as 16 and 17, one can achievea rapid spelling check system to be used with word processors or smallcomputers.

In most word processors the input output and storage functions areprovided in the ASCII code. This code is also sometime referred to asthe USASCII. The code is a standard code for information exchange asspecified by the USA Standards Institute. Examples of the code arewell-known and for example, reference can be had to a text entitled"Reference Data for Radio Engineers" published by Howard W. Sams & Co.(1977), pages 35-45 and 40-25. Hence, in order to operate, each wordwhich is stored in the memory has to be converted so that it can becompared and employed using a minimum number of errors. In this systemeach letter of a word is translated from the ASCII code to a binarynumber having a 24 bit value. For example, in the ASCII code, the letter"A" is designated as 41, B-42, C-43 . . . Z-5A which are equivalent toHEX 01, 02, 03 . . . 1A. Each letter is translated to a 24 bit binaryvalue. For example, the letter "A" is 0000 0001, the letter "B" is 00000010 and so on.

After translating a letter, a 24 bit value, starting at zero thetranslated letter is rotated 5 bits to the left and then the translatedletter is added to the rotated 24 bit value (exception: 6 letter worduse a 7 bit rotate). This is done for each letter in the word and theresulting binary term is the code for that word. Let us give thefollowing examples. Assume that the word to be translated is APPLE

1. Change all letters to upper case, then translate:

A=1 HEX

P=10 HEX

P=10 HEX

L=OC HEX

E=O5 HEX

2. Rotate the 24 bit value

00 00 00 (the first time, it stays 0)

3. Add the "A" or 01 HEX

00 00 01

4. Rotate (rotate 5 bits since it is a 5 letter word)

00 00 20

5. Add the "P" or 10 HEX

00 00 30

6. Rotate

00 06 00

7. Add another "P" (10 HEX)

00 06 10

8. Rotate

00 C2 00

9. Add the "L" (OC HEX)

60 C2 0C

10. Rotate

18 41 80

11. Add the "E"

18 41 85

Therefore, the word APPLE is stored as 18 41 85. This then is the codefor the word APPLE.

Referring to FIG. 2, there is shown a schematic diagram depicting therotation performed by the system in Step 9 to Step 10 to give one aclear understanding of exactly how the code for each word is formed.Thus, as shown in Step 9, after adding the "L", one obtains the binarynumber 00 C2 0C which is shown in FIG. 2.

Now, because the word APPLE is a 5 letter word, 5 bits are rotated. FIG.2 shows the rotation of the first 5 bits and the sequence of rotation.Hence, the new word starts from the 6 bit which is a 0 to form theremainder also shown in FIG. 2 which is the number 18 41 80. The processcontinues for longer words until each letter has been added to the 24bit rotating value. Hence, the entire word list is converted by thisprocess to yield, for example, 70,000 codes of 24 bits each which codesrepresent the MWL 11. The codes are stored on a floppy disk as shown inFIG. 3.

FIG. 3 is a schematic of a floppy disk 30 having a central aperture 21.In this storage system, the most common word length group which arewords consisting of 8 letters are stored in the center of the disk. Thispositioning enhances access and hence increases system speed. The 8letter word group is surrounded by a less frequently occuring wordlength group such as 7 letters, 6 letters above; 9 letters, 10 lettersand so on below. Within each word length group, the 24 bit codes arearranged in ascending numerical order. Hence, the system can use anormal binary search technique to determine the presence or absence ofany word.

In the system to be described, words are selected from a given documentin the sequence that they occur in that document. Each word is convertedto all capital letters and then translated into a series of HEX valueswith "A" represented by 1 and "Z" represented by 1A (or 26 decimal).

The above noted sequence is applied to each word to obtain a 24 bitcode. The coded word is then looked-up in the MFU list 10, then in thetwo internal lists 16 and 17. The next check is on the MWL 11. Thelength of the word is determined and the system accesses the appropriatesection of the disk 10 to perform a binary search for a matching 24 bitcode. If the match is not found in any of the above memories, then asearch is made in the PWL 12. The search ends when a match is found. Ifa match is not found it is assumed that a word is misspelled and thisword is added to the output file of misspelled words and also stored inmemory 17.

Depending upon the syntax of a particular word processing system, theoperator can use the output file to locate and correct misspellings.Hence, the system will determine whether or not a word is misspelled andleave it to the operator to correct the spelling as necessary. The MWL11 in addition to the 24 bit code uses several other codes to indicatepreferred hyphenation points. The word processor may pass to the systema word and if the word is spelled correctly, the system has the capacityof sending back to the word processor hyphenation points. Thishyphenation capacity is an optional feature. Hyphenation points are inthe form of a series of numbers that designate the number of letters tothe next hyphenation point. For example, the word HISTORY would befollowed by a 3 and a 2. That would mean a hyphen is allowed after athird letter (reading left to right) and after the next two letters asHIS-TO-RY. This feature is used to automatically hyphenate a documentduring a repagination or a print operation.

Referring to FIG. 4, there is shown a schematic diagram of a spellingerror detector apparatus according to this invention and which apparatusoperates with the memories and codes described above in conjunction withFIGS. 1, 2 & 3. Numeral 30 references a keyboard which is a conventionalkeyboard associated with a word processor or small computer. Keyboardsas 30 are known in the art and many examples exists in word processors,as well as conventional typewriters. As the keyboard is accessed, thedocument is typed and each letter is stored in a disk or memory device31 associated with the word processor. Examples of suitable devices as31 are well-known in the art, as well as techniques for storing datatherein. Each time a key is struck it is stored in the memory as anASCII code which is depicted in module 32 for a clearer understanding.It is of course understood that the majority of word processors doexactly this as is known in the art.

The memory 33 of the word processor may be a disk storage or otherdevice which in turn stores each letter of every word contained in thedocument. The first sequence of operation that the system performs is toretrieve the first word from the storage 33. Hence, module 34 is anaddress register which is sequentially operated to retrieve each wordand letters stored in the disk storage device 33. Module 34 is aconventional digital logic circuit which will access the disk storagememory to retrieve any given word. The word retrieved is stored inmodule 34 in a register and the first letter of the word is accessed andtransferred to register 35. This letter is converted to a number between1 and 26 as indicated above. Hence, letter A=1 and so on as described inmodule 37.

The letter is converted to a binary number in module 38. The conversionof any code to a binary code is well-known in the art and hence, eachletter is converted as shown in module 38 so that it is represented bythe proper binary value. Simultaneous to the binary conversion, a 24 bitregister 40 is rotated 5 times to the left. If the word has 6 letters itis rotated 2 times more as shown in module 41. Thus, the registersdepicted in 40 and 41 employ conventional techniques. The convertedletter is then added to the contents of the binary register in module 42which module is a conventional digital adder. The rotation as given inthe above example is repeated for each letter in a word as depicted inmodule 41. If all letters have been rotated, the system automaticallyretrieves the first letter of the next word and the process continuesfor the next word. Until each letter is added and rotated for a word,the sequence is repeated as explained above until the word is convertedto the system code. The final value of the rotated word is stored in a24 bit register 50. If hyphenation is requested the operator will pressa key which will activate a flip flop 51. If hyphenation is notrequested then a spelling check for that word is made.

Let us first assume that hyphenation is not requested. Thus, the systemimplements a search of the memories as indicated in module 52. The firstmemory accessed is the MFU memory 10. The memory is accessed by means ofa binary search. The 24 bit word is then compared with the 24 bit wordstored in the MFU list. The binary search starts from the center of thelist and the word in register 50 is compared with the binary value ofthe center word. If it is higher, then the search commences from thecenter word to larger binary numbers. If it is lower, then the searchcommences from the center to the lower binary numbers. If the word isfound in the MFU list it is stored in the last 64 word list 16 asindicated in module 53. If the word is not found in the MFU list 10 thena sequential search is made of the 64 word list 16. A sequential searchis not a binary search but essentially the 24 word bit is compared witheach of the 64 words stored in list 16 in sequence. If the word is foundin list 16, it is retained therein as indicated by module 54. If theword is not found in memory list 16, a search of memory 17 isaccomodated in sequence. Memory 17 contains the list of the last 32misspelled words. If the word is not found in memory 17, therefore, itis not necessarily misspelled and is not added to the error file 57 atthis time. It is understood that if the word is found in List 17, it isa misspelled word and, therefore, added to the word error File 57.

The word error file 57 may be a conventional memory such as a disk,cassette and so on or may be part of the word processor memory. In anyevent, if the stored word is not found in memory 17, the master wordlist or MWL memory 11 is accessed in a binary search. Hence, the 24 bitword is again checked with the 70,000 words in MWL 11. If it is found inMWL 11, it is added to memory 16 as indicated by module 58. If it is notfound in MWL 11, a search of the personal word list 12 is made. If it isin the personal word list 12, it is again added to memory 16. If it isnot in the personal word list then it is assumed that it is misspelled.This word is then inserted into memory 17 and therefore designated as amisspelled word.

As one can ascertain, each time a correct word is accessed, it is addedto memory 16 as a last check word and the process begins anew and thenext word is retrieved from the disk storage 33 via memory 34 until eachword in the document has been verified.

Referring back to the diagram, if hyphenation is requested, a search ofthe MWL 11 is made via module 60. As indicated, hyphenation points arestored in MWL 11 and the PWL 12 by numbers which follow the word. If theword is found in the MWL 11, it is compared with the words stored in thecollision list memory 13. This is done in module 61 to avoid hyphenatinga word which may be improper. If the word is in the collision list 61,this will be indicated to the operator via module 62. The operator willthen resolve the collision by correcting the spelling via the wordprocessor. If this is done, the system will then send the hyphenationpoints as shown in module 63.

Thus, as shown above, the system described can compare any stored wordwith 24 bit words as stored in the above described memories to determinewhether or not the word may be misspelled. Based on the above describedcode it has been determined that the system provides a collision rate ofless than 0.025%. For example, in every 4,000 misspelled words which aredetected by the system, only one word will be reduced to the same codeas another misspelled word. Due to the organization of the system, aword search is rapidly performed as the system reserves the longestsearch, for example, the search of the MWL as the last search.

As one can ascertain from the above described description, the entiresystem can be simply implemented by conventional integrated circuitlogic and conventional memory elements. These components are readilyavailable and one skilled in the art will have no difficulty inimplementing the system described. It is also understood that the entiresystem can be programmed by using a conventional microprocessor. Theflow chart depicted in FIG. 4 can be implemented with a microprocessor.The minimum requirements are a 64K memory, two floppy disk drives likeone for the dictionary and one for the document, a video displayterminal and a central processing unit such as the Z-80 microprocessor.The system provides the operator with an indication of all misspelledwords within a document and the entire content of memory 17 or the errorfile memory can be displayed for each document. In this manner, theoperator employs the system to automatically check spelling and cantherefore correct spelling errors before the document is finally printedout. This of course saves a great deal of time and effort andsubstantially enhances the capability of the word processor.

Many alternate techniques and modifications will become apparent tothose skilled in the art upon reading the above specification and allsuch modifications are deemed to be encompassed within the spirit andscope of the following claims.

I claim:
 1. A spelling error detector apparatus for use in detectingspelling errors in alphabetical words stored in a test memory associatedwith a word processor and indicative of a document to be printed fromthe stored words, comprising:a code storage memory having stored thereina plurality of digital codes, each code derived from a separatealphabetical word by first translating all letters in said word to afirst digital code, then forming an all zero digital word of a givennumber of fixed bits, then adding the first code of the first letter tosaid digital word and then rotating said added word a given number ofbits and then adding the first code of the second letter to said rotatedword and continuing said rotation and addition for all letters in saidword to form said code as stored in said memory, means for retrieving aword as stored in said text memory associated with said processor, meansfor converting said retrieved word into a code according to said code asstored in said code storage memory including means for translating eachletter of said stored word into a first digital code; register means forforming an all zero digital word and adding means for adding said firstdigital code of said first letter of said retrieved word to said allzero word and means for rotating said added word said given number ofbits and then adding said first digital code of said second letter tosaid rotated word and means for continuing said rotation and additionfor all letters in said retrieved word to form a code indicative of saidstored codes, means for comparing said code formed as indicative of saidretrieved word with said stored codes with a favorable comparisonmanifesting a correct spelling and with an unfavorable comparisonmanifesting an incorrect spelling.
 2. The spelling error detectoraccording to claim 1 wherein said storage code memory comprises,a firststorage section for storing therein said codes indicative of mostfrequently used alphabetical words a second storage section for storingtherein said codes indicative of a master word list indicative of amajority of other alphabetical words a third storage section for storingtherein said codes indicative of a personal word list peculiar to aparticular field.
 3. The spelling error detector according to claim 1further including;first misspelled storage means responsive to saidunfavorable comparison for storing therein said new word code upon anunfavorable comparison.
 4. The spelling error detector according toclaim 1 further including,second properly spelled storage meansresponsive to said favorable comparison for storing therein said newword code upon a favorable comparison.
 5. The spelling error detectoraccording to claim 1 wherein each word is coded is represented by 24binary bits.
 6. The spelling error detector according to claim 2 whereinsaid first storage location is capable of storing between 5000 to 8000words.
 7. The spelling error detector according to claim 2 wherein saidsecond storage location is capable of storing between 700,000 to 100,000words.
 8. The spelling error detector according to claim 2 wherein saidthird storage location is capable of storing between 1 to 10,000 words.9. The spelling error detector according to claim 3 wherein saidmisspelled storage means is capable of storing at least 32 words. 10.The spelling error detector according to claim 4 wherein said secondproperly spelled storage means is capable of storing at least 64 words.11. The spelling error detector according to claim 1 wherein said codestorage memory is a disk memory.
 12. A method of coding alphabeticalwords of a given number of letters into digital words for storage in amemory for use in a misspelled word detector apparatus for providing alist of stored words each having a unique code for providing a lowcollision rate for said misspelled detector apparatus comprising thesteps of:converting each letter of said word into a binary code having agiven number of bits, forming an all zero binary word of a given bitlength, adding the binary code of said first letter to said formed wordto form a new word, rotating a given number of bits of said new word toform a rotated word, adding the binary code of the second letter to saidrotated word to form a new word, rotating said new word, said givennumber of bits to form a new rotated word, adding the next binary codeof the next letter to said new rotated word, and repeating said steps ofadding and rotating until all the letters of said word are used, storingsaid final obtained word as a code uniquely indicative of saidalphabetical word, and then comparing said stored final obtained wordcode with similarly coded and stored alphabetical words indicative of adocument to be printed in a misspelled word detector apparatus to detecta possible misspelling of said indicative alphabetical words.
 13. Themethod according to claim 12 wherein said binary code is an ASCII code.14. The method according to claim 12 wherein said word of a given bitlength is 24 bits.
 15. The method according to claim 14 wherein sixletter words are rotated to the left using 7 bits.
 16. The methodaccording to claim 14 wherein all other letter words are rotated to theleft using 5 bits.
 17. The method according to claim 12 wherein the stepof storing is storing said code on a floppy disk.
 18. The methodaccording to claim 17 wherein most common alphabetical length words aswords having eight letters are stored on said disk as coded via saidcode in a concentric annular area located midway between the center andthe outside edge of said disk with smaller length words stored in aconcentric outer area and longer length words stored in a concentricinner area.
 19. The method according to claim 18 wherein said mostcommon alphabetical length words are 8 letter words.
 20. The method ofcoding according to claim 12 further comprising the steps of:Placing anarbitrary alphabetical word into a storage location, converting eachletter of said word into a binary code and thereafter performing thesteps of claim 13 to form a new arbitrary coded word, comparing the codeof said new word with the codes of all final stored words to find amatch indicative of a correct spelling.
 21. The method of codingaccording to claim 20 further including the step of:indicating when allfinal stored words do not compare with said arbitrary new wordindicative of a misspelled word.
 22. The method of coding according toclaim 21 further including the step of storing said code indicative ofsaid misspelled word in a separate memory.